Coastal barometer blog | Coastal barometer website |
Here I will run the exploration of the response variables (aquaculture production and fisheries production) and of the covariates
library(vroom)
library(here)
library(lattice)
library(cowplot)seafood <-read.csv(here("prep", "output_data", "spfood_allvars.csv"))Calculate total sea food production here as well. I also create a log-transformed fisheries catch, because it is highly skewed to the left (see histograms below)
seafood_prep <- seafood |>
rowwise() |>
mutate(totprod = sum(aqua_prod, total_catch, na.rm=T)) |>
mutate(aqua_prop = aqua_prod/totprod) |>
mutate(fish_catch_log = log(total_catch + 0.001)) |>
mutate(aqua_prod_log = log(aqua_prod + 0.001))I will use extensively the approaches and functions from the two Highland statistics ltd books: * Mixed effects models and extensions in ecology with R. (2009). Zuur, AF, Ieno, EN, Walker, N, Saveliev, AA, and Smith, GM. Springer
and
Cleveland dotplots of the variables
Mydotplot(seafood_prep[,c(5,6,8,9,10,11,12,13)]) We observe an excess of zeros or close-to-zero values in the population, population growth, unemployed, and total catch variables.
Check the % of zeroes in the variables:
zeros_prop <- function(x){ #x is a numerical vector
rows = sum(!is.na(x))
zeros = sum(x == 0, na.rm = T)
prop = zeros/rows
prop
}myvars <- seafood_prep[,c(5,6,7,8,9,10,11)]
map(myvars, ~zeros_prop(.x))## $aqua_prod
## [1] 0.00119
##
## $total_catch
## [1] 0.154
##
## $sea_area
## [1] 0
##
## $wageco
## [1] 0
##
## $population
## [1] 0.00928
##
## $popgrowth
## [1] 0.0441
##
## $unemployed
## [1] 0
Only fisheries has 15% of zeros, other variables do not suffer from zero inflation
Check out the distribution of variables:
varnames <- list("aquaculture", "fisheries", "p90/p10", "population", "population growth","unemployed", "seaarea")
par(mfrow = c(2,4))
hist <- map2(myvars,varnames, ~ hist(.x, main = .y, col = "seagreen", xlab = " ", ylab = ""))Most of the variables are left-skewed, but fisheries may need to be transformed, or later, the response variable aquaculture/(aquaculture + fisheries) may need to be transformed.
The boxplots of the variables, to check if there are outliers (formally): yes, since the variables are all left-skewed
par(mfrow = c(2,4))
map2(myvars,varnames, ~ boxplot(.x, main = .y, col = "seagreen", xlab = " "))Let’s take a look at the variable aqua_prop which is a proportion of aquaculture in the total seafood production and at the log of the fisheries catches
par(mfrow=c(1,3))
hist(seafood_prep$aqua_prop, main = "Aquaculture proportion", xlab = "", col = "orchid")
hist(seafood_prep$fish_catch_log, main = "Log of fisheries catches", xlab = "", col = "orchid")
hist(seafood_prep$aqua_prod_log, main = "Log of aquaculture production", xlab = "", col = "orchid") It is highly skewed to the right, a lot of values are well below 1. Should I transform this variable? Not sure yet.
Mypairs(myvars) Clear association is only seen for the population and growth, and between the number of unemployed and population growth (positive relationship in both cases).
MyX <- c("year", "sea_area", "wageco", "population", "popgrowth" , "unemployed")
MyMultipanel.ggp2(Z = seafood_prep,
varx = MyX,
vary = "aqua_prop",
ylab = "aquaculture % in seafood production",
addSmoother = TRUE,
addRegressionLine = FALSE,
addHorizontalLine = FALSE) ## Warning: Removed 192 rows containing non-finite values (stat_smooth).
## Warning: Removed 192 rows containing missing values (geom_point).
No clear associations or patterns, but i need to consider transformation of a response variable.
What if aquaculture production alone would be used as a response variable? Let’s see also if fisheries catch has anything to do with aquaculture produciton
MyMultipanel.ggp2(Z = seafood_prep,
varx = MyX,
vary = "aqua_prod",
ylab = "aquaculture production",
addSmoother = TRUE,
addRegressionLine = FALSE,
addHorizontalLine = FALSE) ## Warning: Removed 192 rows containing non-finite values (stat_smooth).
## Warning: Removed 192 rows containing missing values (geom_point).
This looks more interesting! Relationships are close to linear, but for population and wage coefficient maybe not.
Might be interesting to examine their associations statistically
fish_aqua <-ggplot(
data = seafood_prep,
mapping = aes(x = fish_catch_log, y = aqua_prod)) +
geom_point() +
geom_smooth()
fish_aqua## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).
MyMultipanel.ggp2(Z = seafood_prep,
varx = MyX,
vary = "fish_catch_log",
ylab = "fisheries catch",
addSmoother = TRUE,
addRegressionLine = FALSE,
addHorizontalLine = FALSE) ## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).
I will need to transform fisheries catch to log(catch) for the analysis
MyX2 <- c("year", "sea_area", "wageco", "population", "popgrowth" , "unemployed", "fish_catch_log")
MyMultipanel.ggp2(Z = seafood_prep,
varx = MyX2,
vary = "aqua_prod",
ylab = "aquaculture production",
addSmoother = TRUE,
addRegressionLine = FALSE,
addHorizontalLine = FALSE) ## Warning: Removed 217 rows containing non-finite values (stat_smooth).
## Warning: Removed 217 rows containing missing values (geom_point).
Just to see if the distribution of fisheries and aquaculture production over years look very different when plotted per municipality. There is a clear increase in the production over years!
ggplot(data = seafood_prep) +
geom_point(mapping = aes(x = year, y = aqua_prod), size = 0.5) +
geom_smooth(mapping = aes(x = year, y = aqua_prod), size = 0.4) +
facet_wrap( municip ~ .) +
theme(
axis.text.x = element_text(size = 10, angle = 90)) +
theme_half_open() +
theme(legend.position = "none") Same but for fisheries
ggplot(data = seafood_prep) +
geom_point(mapping = aes(x = year, y = fish_catch_log), size = 0.5) +
geom_smooth(mapping = aes(x = year, y = fish_catch_log), size = 0.4) +
facet_wrap( municip ~ .) +
theme(
axis.text.x = element_text(size = 10, angle = 90),
legend.position = "none") +
theme_half_open() # Verifying dependence in the response variables I am already quite sure that both responses will be temporally and spatially dependent. But to test for that formally, we can use an autocorrelation test:
seafood_temp <- arrange(seafood_prep, year)
acf(seafood_temp$aqua_prod, na.action = na.pass)acf(seafood_temp$total_catch)